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The CENTER FOR THE STUDY OF EVALUATION OF INSTRUCTIONAL 
PROGRAMS is engaged in research that will yield new ideas 
and new tools capable of analyzing and evaluating instruc- 
tion. Staff members are creating new ways to evaluate con- 
tent of curricula, methods of teaching and the multiple 
effects of both on students. The CENTER is unique because 
of its access to Southern California’s elementary, second- 
ary and higher schools of diverse socio-economic levels 
and cultural backgrounds. 
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COMMENTS ON PROFESSOR tvTLEY’S PAPER ENTI TLED 
“DESIGN AND ANALYSIS OF EVALUATION STUDIES” 

Chester Harris 

We have come into the third day of this conference, and enough 
things have been said in various contexts to make it possible for 
me to point out some things that bear in general on Mr. Wiley 1 s 
paper, but still more generally on the whole set of papers. 

I think that the most important contribution that can be made 
at this point in the conference is to identify and enumerate what 
I regard as three critical issues in the design and analysis of 
evaluation studies suggested in these papers and discussions. The 
area of design and analysis is actively changing and developing, 
and most of us would be hard pressed to predict the extent to 
which these issues will be resolved or reformulated in the near 
future. The measurement problem in evaluation studies involves 
a situation in which we have an instructional package that is to 
be used with some group of human subjects, and then evaluated in 
terms of how good it is. This demands that we adopt some scheme 
for specifying what we mean by “good." 

There appear to be three types of “goodness" for those who 
take the behavior of students as the relevant evidence. One is 
goodness defined as a level of performance; a second is goodness 
defined as change of performance in a specified direction; and a 
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third is goodness defined as change of performance in a specified 
direction to a specified extent. Buried here are the questions of 
which behaviors are relevant and whether the observations that 
are made can become bases for inferences regarding learning as a 
result of the instructional package. This is an issue which Dr. 
Gagn6 posed for us earlier in the session. These three attitudes 
imply somewhat different measurement operations for any chosen type 
of performance. Let us leave this with the further acknowledgment 
that in any study many different types of performance may be re- 
garded as important dependent variables, and that the amount of 
work required to make preparations for an evaluation study may be 
extensive. 

The reality that there may be relevant dependent variables 
also suggests that appropriate designs for evaluation probably 
should be multivariate. This is the first issue which I wish to 
identify, the issue of univariate versus multivariate dependent 
variable studies. My strategy is not to resolve the issue but 
merely to enumerate the factors involved. 

Possibly the simplest design for an evaluation study is that 
which employs only one instructional package and attempts to assess 
its goodness for two or more categories or types of students. Here 
we employ stratifying variables: age, sex, intelligence level, 

residential region, etc., to define our groups of students, and 
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then compare and contrast the various student perfoxmances . The 
intent of such a study is primarily descriptive (though tests of 
significance often are run): to define the goodness of the in- 

structional package with respect to specified groups. This is 
a fixed-effects model, with the chosen levels of the stratifying 
variables being the only ones about which information is gained. 

Here there arises an issue which I will describe by extending the 
design so that more than one instructional package is used. I 
assume that we may retain one or more stratifying variables as well, 
and thus have a reasonably complicated design. I will not, however, 
complicate it by introducing repeated measurements . Such a design 
has as its intent a comparison among instructional packages for 
various groups and sub-groups . I repeat that in practice this is 
a fixed model; for we seem absolutely unable to define a population 
of instructional packages, and, even if we could, to be quite un- 
willing to select at random a set of instructional packages to study. 
Instead, we select the packages arbitrarily and deliberately; this is 
a fixed effect. 

A design such as this has limitations that are inherent in all 
hypothesis testing. Among them is the familiar problem posed by the 
reasonable assertion that no sharp hypothesis can possibly be true. 
Testing such a hypothesis is merely an exercise in testmanship since 
the outcome depends heavily upon the manipular flexibility of the 
test. 



It is perfectly reasonable to assert that no two instructional 
packages can possibly have identically the same effect; thus the 
testing of the hypothesis that two or more such packages have the 
same mean effect can be viewed as relatively unimportant. This 
represents my attitude toward the decision theoretic approach 
which has been mentioned over and over again at this conference. 

Those \dio criticize hypothesis testing urge that we use esti- 
mation procedures instead. The question of what kind of estimation 
procedure is useful here is an important one. Some interest exists 
in developing an analogue of response surface methodology for 
evaluation studies. It is an analogue, since the elements of instruc- 
tion packages that can be identified often exist in only a few 
discrete rather than continuously ordered forms. This creates some 
problems with the statistics, but in time these problems may be made 
manageable. 

The response surface design attempts to vary inputs (elements 
of instruction) to the end of identifying an optimum or maximum 
output performance. This is quite a different approach to evalua- 
tion studies. The choice of this approach as opposed to the more 
conventional fixed model constitutes a second important issue. 

Let me raise a third issue which is often associated with a 
Bayesian point of view in statistics. The fact that we tend to 
interpret every study as if it were being done for the first time 



should make us uneasy, even though we still can not agree on how 
prior information should be incorporated into our analysis. Ac- 
tually, there often are relevant prior findings that remain un- 
used. 

I am reminded of how we behave in directing dissertations. 

We always insist on a summary of previous findings in an early 
chapter, but we would be horrified if the student tried to inte- 
grate them numerically with his findings. The issue here is the 
extent to which, in any evaluation study, the design and analysis 
will ignore all the possible prior distributions. 

A modification in practice- -namely, learning to take into 
account the prior inf oimation- -might be the one that would most 
improve the design and analysis of evaluation studies. 



